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ABSTRACT 

Using non-conventional markers, DNA metabar- 
coding allows biodiversity assessment from 
complex substrates. In this article, we present 
ecoPrimers, a software for identifying new barcode 
markers and their associated PCR primers. 
ecoPrimers scans whole genomes to find such 
markers without a priori knowledge. ecoPrimers op- 
timizes two quality indices measuring taxonomical 
range and discrimination to select the most efficient 
markers from a set of reference sequences, accord- 
ing to specific experimental constraints such as 
marker length or specifically targeted taxa. The 
key step of the algorithm is the identification of 
conserved regions among reference sequences for 
anchoring primers. We propose an efficient algo- 
rithm based on data mining, that allows the 
analysis of huge sets of sequences. We evaluate 
the efficiency of ecoPrimers by running it on three 
different sequence sets: mitochondrial, chloroplast 
and bacterial genomes. Identified barcode markers 
correspond either to barcode regions already in use 
for plants or animals, or to new potential barcodes. 
Results from empirical experiments carried out on a 
promising new barcode for analyzing vertebrate 
diversity fully agree with expectations based on 
bioinformatics analysis. These tests demonstrate 
the efficiency of ecoPrimers for inferring new 
barcodes fitting with diverse experimental contexts. 
ecoPrimers is available as an open source project at: 
http://www.grenoble.prabi.fr/trac/ecoPrimers. 

INTRODUCTION 

DNA barcoding opens new opportunities for biodiversity 
research. This technique is now considered to be a 



powerful tool, both for taxonomical (1) and ecological 
(2) studies. Taxonomies based solely on morphological 
analyses are sometimes problematic due to either conver- 
gence in phenotypes among distantly related species, or 
the failure to identify cryptic species where morphologic 
divergence has not kept pace with genetic divergence (3). 
Though the original aim of DNA barcoding was to assign 
an unambiguous molecular identifier to each taxon (1), 
today new DNA barcoding applications are emerging. 
These applications apply DNA barcodes not as a means 
to unambiguously identify a single specimen from a taxo- 
nomical point of view, but as a tool for better 
characterizing a set of taxa from a complex biological 
sample. This metabarcoding approach (i.e. the simultan- 
eous identification of many taxa from the same sample) 
has a wide range of applications in forensics, ecology and 
palaeoecology. 

Following the original (sensu stricto) barcode definition, 
a barcode marker must be as universal as possible and 
must contain enough information to discriminate 
between closely related species and to discover new ones. 
The Consortium for the Barcode of Life (CBoL: http:// 
www.barcodeoflife.org) leads the standardization of such 
markers. For example, the COI gene is recommended for 
animal barcoding (1). However, in ecological research, 
other constraints must sometimes be considered when se- 
lecting a barcode marker and its associated primers. As a 
consequence, the standardized COI animal barcode that 
clearly fulfills all the requirements for specimen identifica- 
tion (1) is not always the most efficient one for a 
metabarcoding approach. 

Metabarcoding constraints on the locus choice 

Sensu stricto barcode applications prefer long barcode 
markers with high discrimination capacity and, if 
possible, high phylogenetic information content. For 
these reasons the COI gene for animals (1) and rbcL and 
matK genes for plants (4) are recommended by CBoL. 
Metabarcoding has a different aim and requires different 
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optimality criteria for the markers employed: (i) as the 
DNA will often be degraded (and to minimize the risk 
of chimeric sequences) shorter amplicons are needed, 
and (ii) to minimize amplification biases in mixed-template 
reactions, the primers need to be highly conserved. 
Furthermore, taxonomic resolution at the species level is 
not always required. Identification at a higher taxonomic 
level (e.g. family, order, etc.) is sometimes sufficient. Thus 
in some conditions, it might be necessary to select a short 
marker even if its resolution is low. 

Metabarcoding constraints on the primer choice 

Sensu stricto barcode applications usually rely on PCR 
amplifications from good quality DNA extracted from a 
single specimen. This allows the use of degenerate primers 
and relaxed PCR conditions, with the key constraint of 
amplifying the same highly informative standard locus 
from the broadest range of organisms. A contrario, 
metabarcode applications require robust PCR conditions 
allowing unbiased amplifications from a mix of several 
DNA templates which are often degraded [DNA extracted 
from modern and ancient soils (5,6), water (7) or animal 
feces (8,9)]. This imposes the use of highly conserved 
primers for simplifying PCR amplification conditions 
and reducing disequilibrium in amplification among the 
different DNA templates. Moreover, it can be advanta- 
geous to select primers amplifying only a subset of taxa 
for solving a given biological question (i.e. excluding the 
amplification of other taxonomic groups). 

Tracking the ideal barcode markers 

Ideal metabarcode markers should be short, highly dis- 
criminant, restricted to the studied clades and have 
highly conserved primer sites. Such ideal markers might 
not be the same among studies. In many cases this requires 
a specific pair of primers be designed to exactly fit the 
biological question. 

The traditional method for identifying barcode regions 
is human observation of sequence alignments to locate 
two conserved regions flanking a variable one. This 
manual approach obliges barcode designers to work on 
well-known sets of genes. Based on this approach, 
several manually discovered barcode loci are in routine 
use today, including regions of protein encoding genes 
such as COI (1,12), rbcL or matK (4), RNA genes like 
mitochondrial 12S (13) or 16 S (14) rDNA and non-coding 
chloroplast regions such as the trnL intron (15) or the 
intergenic tmH-psbA region (16). Several tools exist to 
help biologists during the primer design step, but they 
were not often developed for the context of DNA 
barcoding. Among them, Primer3 (17) and QPrimer (18) 
use a single training sequence and were clearly not de- 
veloped for designing versatile primers. TmPrime (19) 
and UniPrimer (20) can work on a training set of short 
sequences (i.e. gene sequences), allowing the design of 
primers that amplify several homologous sequences. But 
these tools are not adapted for long sequences (i.e. whole 
genomes) and do not take into account the taxonomic 
discrimination capacity of the amplified sequence during 



the primer selection process. More interestingly, 
PrimerHunter (21) was developed to select highly specific 
primers for distinguishing virus subtypes, a typical sensu 
lato barcoding application. Unfortunately, its efficiency 
on large data sets of long sequences is problematic. We 
were unable to run it on a 13.7 MB (Megabyte) database 
corresponding to the full set of whole mitochondrial 
genomes extracted from GenBank. Finally, Amplicon 
(22) allows for selecting specific primers to a group of 
aligned sequences and excluding a counterexample data 
set. But, as Amplicon requires aligned sequences, it can 
only design primers from a set of short regions compatible 
with multi-alignment software capacity and so cannot be 
run with a whole-genome data set. 

To efficiently infer new metabarcode markers, we de- 
veloped a software, ecoPrimers, fulfilling the following 
prerequisites: (i) the ability to scan a large database of 
whole genomes allowing the selection of markers 
without a priori identification, (ii) the ability to select 
highly conserved primers among a training set of se- 
quences (example sequences) and possibly not amplifying 
a counterexample set of sequences (iii) the ability to test an 
amplified region for its capacity to discriminate among 
taxa. For achieving these goals, we took advantage of 
two indices previously proposed to evaluate in silico the 
relative quality of barcode primers in the context of 
metabarcoding (10). The first index, B c , estimates the 
coverage or taxonomical amplification range of a primer 
pair. The second, B s , evaluates the taxonomical discrimin- 
ation capacity of the amplified marker among the 
amplified taxa. These indices have been successfully used 
by Bellemain et al. (11) to demonstrate the importance of 
primer selection for metabarcoding studies of fungal 
communities. ecoPrimers selects primer pairs by 
optimizing these two indices. A special effort was made 
to ensure computational efficiency of the program, and 
this was tested on the one thousand bacterial genomes 
currently available in public databases. 

Here we used ecoPrimers to design specific primer pairs 
for bacterial, chloroplast and mitochondrial genomes. 
Validation by empirical experiments of the primer pairs 
selected to identify the vertebrates confirms that 
ecoPrimers proposed specific and robust primer pairs for 
amplifying target sequences. ecoPrimers is available as an 
open source software at: http://www.grenoble.prabi.fr/ 
trac/ecoPrimers. 



MATERIALS AND METHODS 

Problem formulation 

We assume that all sequences are texts over the DNA 
alphabet {A, C, G, T}, and that the orientation of se- 
quences is unknown. Given a set of example sequences 
E s and an optional second set of counterexample se- 
quences C s , we want to identify highly conserved 
primers which are present in the largest possible subset 
of E s and in the smallest subset of C s . Highly conserved 
primers are defined as words of length l p , (i) strictly 
present in at least Q s sequences of E s , (ii) present in at 
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least Q e sequences of E s with no more than e mismatches 
(optionally we can impose that these errors are not located 
in the n last 3' bases of the primers to be more realistic in 
subsequent empirical DNA amplification), (hi) not present 
in more than Q x sequences of C s . The same approximative 
matching conditions used for Q e are applied to this 
quorum. By default Q s is set to 70% of |.E S |, Q e is set to 
90% of \E S \ and Q x is set to 10% of |C,|. Identified po- 
tential primers are then paired with respect to their loca- 
tions and orientation to allow amplification of those DNA 
fragments that are within the size range specified by the 
user. 

Algorithm 

In a nutshell, our method consists of five steps: (i) finding 
strict primers (i.e. without mismatch) from E s respecting 
Q s ; (ii) using these strict primers as models to find their 
non-strict occurrences (i.e. with mismatches) in E s to 
check Q e and in C s to check Q x ; (iii) building the primer 
pairs, (iv) evaluating B c and B s indices to select the best 
primers, and (v) estimating the melting temperature of 
each of the primers in selected pairs. 

Finding strict repeats. Finding conserved regions among a 
set of sequences is an equivalent problem to finding 
repeats among those sequences. Identification of repeats 
in DNA sequences is a well-known problem in bioinfor- 
matics and many efficient data structures and associated 
algorithms exist for finding strict repeats, such as KMR 
(23), suffix tree (24) and suffix array (25). These algo- 
rithms work well on short sequences but are not efficient 
enough for us in terms of memory usage for finding 
repeats in a quorum of a large number of very long se- 
quences (i.e. the set of all whole sequenced bacterial 
genomes available in public databases, approximatively 
1000 genomes and 3 Gb (gigabases) of sequences). The 
best implementation of suffix tree was developed in 
Reputer (26). It uses about 12.5 bytes per nucleotide to 
build the data structure. This compact implementation is 
based on a 32 bit architecture; consequently it cannot ma- 
nipulate sequence data larger than 340 Mb (megabases). 
Similarly, the most compact implementation of KMR is 
done in RepSeek, (27) which uses about 9 bytes per nu- 
cleotide on a 32 bit architecture, corresponding to a limit 
of 475 Mb. The last structure, suffix array, requires 4 bytes 
per nucleotide on a 32 bit, and 4 more bytes to be effi- 
ciently used to infer repeats. These two values have to be 
multiplied by 2 on a 64 bit architecture. Finally, as we do 
not assume that all the sequences are in the same orienta- 
tion, we have to encode the direct and the reverse strand in 
the data, multiplying by two the memory requirement. 

These three algorithms simultaneously identify 
conserved motifs and the positions of their occurrences. 
Following our brief description of the ecoPrimers algo- 
rithm, we just need the motif and the number of the se- 
quences in which they occur. We do not need their exact 
positions, as they will be recomputed in step (ii) taking 
into account mismatches. We take advantage of this to 
gain memory compactness. 



For ecoPrimers we have developed a simple algorithm 
for finding strict repeats which is notably compact in 
memory. This algorithm is based on a sort and a merge 
algorithm and some data mining steps. The algorithm pre- 
sented in Figure 1 (named Strict Primer Algorithm, SPA) 
gives the outline of our strict repeats finding procedure 
without a data mining step. 

In the first step, we load all sequences in memory. Then 
we construct an empty list L P that will contain the strict 
repeats found at the end of the algorithm as a set of couple 
(W, ri) where Wis a word and n is the number of sequences 
where it occurs. In the third step, for each input sequence 
Si of E s , we build L w , the list of all overlapping words of 
length l p . For purpose of compactness, words are saved as 
a 64-bit binary hash code (named further D codc , or R coc u) 
following the encoding schema {A = 00, C = 01, G = 10, 
T= 11}. This allows us to manipulate words up to 32 
nucleotides long. 

To look for repeats in both strands of a DNA sequence, 
standard algorithms are required to store direct and 
reverse sequences in their data structures. In a double 
stranded DNA sequence, occurrence position is defined 
by a position and an orientation. As in our algorithm, 
occurrence positions are not important at this stage, orien- 
tations of enumerated words do not have to be stored. 
Thus, if a word W occurs n times in both strands of a 
sequence, W the reverse complement corresponding 
word of W also occurs « times. ^Therefore we just need 
to count one of the two (W or W). The actual counted 
word for a given word pair (W, W) is the one correspond- 
ing to the smaller hash code between D code and R co de- 

Sorting (Step 7) is achieved using the Smoothsort algo- 
rithm (25,28). This algorithm has a complexity of 0(72log«) 
in the worst case, as do several other sorting algorithms, 
but has a complexity near to 0(n) when the input array is 
almost ordered. 

The merge (Step 9) of the two lists L P and L w is 
achieved in place and in a linear time using just an extra 
buffer of size = minimum(\L P \, \L^). During this merging 
step words that will not be able to respect Q s are 



1 - Load sequences in memory 

2 - Create an empty pattern list Lp of couples (W,n) where 
If is a word and n is the number of sequences where it 
occurs; 

for all sequences Si 6 E s do 

3 - Build empty list of binary words Lyy; 
for all Words W £ S» of length l p , do 

4 - Build D code the hash code of W; 

5 - Build R co de me hash code of W the reverse 
complement of W; 

6 - Append Minimum(D col j e ,R code ) to Lm\ 
end for 

7 - Sort L\y; 

8 - Remove duplicates; 

9 - Merge Liy with Lp updating count n in Lp; 

10 - Remove couple from Lp that cannot meet Q s 
conditions; 

end for 

Figure 1. Strict primer algorithm (SPA) used for finding strict repeats. 
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eliminated of L P . Despite this, the \L P \ increases quickly 
until \E S \ — Q s sequences are analyzed (Figure 2a). This 
technique is sufficient for data sets of reasonable size, 
but for large data sets like fully sequenced bacterial 
genomes having total size of approximately 3 Gb, it 
consumes a significant amount of memory. To overcome 
this problem a pre-filtration/data-mining step was added. 

Data mining. Data mining used for finding strict repeats is 
based on the fact that all words W of size l p present in at 
least Q s sequences of E s are composed only of words W m 
of size l m < l p present in at least Q s sequences of E s . Using 
the binary encoding schema presented previously, we built 
a complete hash table H,„ of all words W m of size /„, = 13. 
Each cell of this table stores the count of sequences 
where the corresponding word occurs. As we have 
4 13 = 67 108 864 different words of size /„„ and for each 
word the hash table used 4 bytes, 256 MB of memory is 
required to store it. This size is small if we compare it to 
the 3 GB used to store the bacterial genome sequences and 
more than 8 GB used by SPA to store the L P list corres- 
ponding to these sequences. H m is built in a linear time. 

To include data mining in SPA, we just added a condi- 
tion on H m in the building hash code methods of Steps 3 



and 4 (Figure 1), verifying the assertion that no word 
W m e W is present in less than Q s sequences. As compu- 
tation of the next hash code at Steps 3 and 4 is achieved by 
bit shifting of the previous one, only one lookup into H m is 
required per hash code generated. Each lookup is done in 
constant time so data mining does not change the global 
complexity of the initial algorithm. 

Finding approximate primers. In the above step we have 
found a list of words L P which are present in at least Q s of 
the E s . In this step, we find the approximate occurrences of 
these words in all the example sequences S e e E s and all the 
counterexample sequences S c e C s . For this purpose, we 
use these strict words as patterns and find their approxi- 
mate occurrences using the agrep algorithm (29). At 
the end, we conserve only words occurring in more 
than Q e sequences of E s with no more than e errors 
(i.e. mismatches). From these words, the words which 
are not present in more than Q x sequences of C s are 
tagged as good primers. 

Pairing the primers. Words must finally be paired to 
delimit potential barcode regions. Pairing is done for all 
the sequences with an almost linear time algorithm 
checking the minimal (/ m i n ) and maximal length (/ max ) 
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time vs sequence count withount data mining (d) 
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Figure 2. Comparison of time and memory usages of the both versions of the SPA. (a) Memory used with respect to the sequences processed 
without data mining step. Memory used increases rapidly until strict quorum (70%) starts taking effect after 271 (30% of 905) sequences have been 
processed (b) Same but with data mining step. Only a small number of prefix of 13 bases for primers of length 18 bases pass the strict quorum, hence 
memory used is significantly small, (c) Time required to process the sequences without data mining increases exponentially until strict quorum starts 
making effect and after that time becomes linear, (d) With the data mining step added, time required becomes linear. 
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constraint imposed on the potentially amplified sequence. 
Each pair must contain at least one good primer (specifi- 
city of a single primer is enough to ensure specificity of the 
amplified region). A primer pairs is composed of two 
words and^their relative orientation indicates which one 
of W and W must be used as primer. Once orientation is 
defined only pairs satisfying the constraint of no 
mismatches on the n last 3' bases of the primer are 
conserved. 

Applying the quality indices. Once constructed, the primer 
pairs can be evaluated using both the indices B c and B s 
defined in Ficetola et al. (10). B c the barcode coverage 
index is the ratio between the number of amplified taxa 
and \E S \. B s the barcode selectivity index is the ratio 
between the number of identified taxa and \E S \. These 
indices can be efficiently computed in ecoPrimers using 
data stored during the pairing process. 

Melting temperature calculation. ecoPrimers uses the 
nearest neighbor thermodynamic model (30) for melting 
temperature (Tm) computation. Using this technique we 
estimate Tm of the perfect match of the primer and of the 
worst match of the primer on the example sequence. The 
temperatures are calculated using the following formula: 



AS +0.368 x N/2 x \n(Na+) + R x ln(C) 

Here, AH and AS" are enthalpy and entropy changes for 
annealing reaction respectively. This annealing reaction 
results in a duplex having Watson-Crick base pairs. N is 
the total number of phosphates in the duplex, R is the 
universal gas constant, C is the total DNA concentration 
from (30) and Na + is the concentration of salt cations. AH 
and AS are computed by summing experimentally 
estimated contributions of constituting dimer duplexes 
as in (21). 

Empirical ecoPrimers evaluation 

ecoPrimers must be evaluated for its computational effi- 
ciency and the quality of its results. Efficiency was tested 
using the large eubact data set {vide infra). The quality of 
the results proposed by ecoPrimers can be checked by 
comparing proposed barcodes with ones currently used. 
If we assume that previously used barcodes were 
designed empirically but correctly, we hope that a subset 
of ecoPrimers results must correspond to them. For this 
purpose three different training data sets and their 
associated parameters were used. 

The eubact data set contains 905 whole eubacteria 
genomes extracted from Genome Review release 115 
(http://www.ebi.ac.uk/GenomeReviews) (31). They cor- 
respond to 603 species belonging to 311 genera. Their 
median size is 3.5 Mb. To identify barcodes similar to 
those used in bacterial biodiversity studies of soil (33), 
ecoPrimers was run on this data set using default param- 
eters and searching for a marker of size smaller than 1 Kb 
(kilobases). The e parameter was set to 3. 

The chloro data set contains 175 whole chloroplast 
genomes extracted from Genbank using eutils web api 



(http://eutils.ncbi.nlm.nih.gov) in January 2010. They cor- 
respond to 174 species belonging to 145 genera. From 
these sequences 119 belong to Tracheophyta (vascular 
plants, NCBI Taxid: 58023) corresponding to 118 species 
in 93 genera. The median size of the 175 sequences is 152 
Kb. In order to find markers useful for environmental 
studies on vascular plant biodiversity (15), ecoPrimers 
was run on this data set with the default parameters, 
searching for markers with a size ranging from 10 bp to 
120 bp. The e parameter was set to 3. The search was 
taxonomically restricted to Tracheophyta. 

The mito data set is composed of 2044 whole mitochon- 
drion genomes extracted from Genbank using eutils web 
api. They correspond to 2002 species belonging to 1549 
genera. Among these sequences 1293 belong to Vertebrata 
(NCBI Taxid: 7742) corresponding to 1261 species in 966 
genera. The median size of the 2044 sequences is 16.6 Kb. 
To search for markers usable in diet analysis studies of 
Carnivora, ecoPrimers was run on this data set with the 
default parameters, looking for markers with a size 
ranging from 50 bp to 120 bp. The e parameter was set 
to 3. On this data set two taxonomical restrictions were 
used. The first restricts the example sequence set E s to 
NCBI Taxid: 7742 (Vertebrata) to optimize primers for 
vertebrates. The second defines the Cs counterexample 
sequence set to NCBI Taxid: 1 (Root) requiring that 
primers not match on sequences belonging to non- 
vertebrates. 

In silico primer checking 

Primers were checked against full Nucleic EMBL 
Standard release 103 database using the electronic PCR 
software ecoPCR (10). The resulting ecoPCR output file 
contains all data about potentially amplified sequences, 
among them the size of the amplicon, the number of 
mismatches associated to each primer and the taxa 
associated with the amplified sequences. 

Empirical primer testing 

Empirical testing was done for only one primer pair, 
named 12S-V5. This primer pair was designed by 
ecoPrimers when run on the mito data set with the 
above mentioned parameters. This primer pair had rea- 
sonably high values of B L . and B s indices with relatively 
short amplification length as shown in Table 3, making it 
suitable for amplification from degraded DNA. 12S-V5 
primer pair was empirically tested in diet analysis of 
three felid species, namely snow leopard (Uncia undo), 
common leopard (Panthera pardus) and leopard cat 
(Prionailurus bengalensis) using feces as a source of 
DNA. The feces sampling was done by field workers of 
The Snow Leopard Trust (http://www.snowleopard.org). 
Snow leopard feces were collected from Mongolia in 2009 
while common leopard and leopard cat feces were col- 
lected from Pakistan in 2008. 

DNA extractions were performed from about 15 mg of 
feces with the DNeasy Blood and Tissue Kit (QIAgen 
GmbH, Hilden, Germany) and recovered in a total 
volume of 250 ul. Amplifications were carried out in a 
final volume of 25 ul, using 2ul of DNA extract as 
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template. The amplification mixture contained 1 U 
AmpliTaq® Gold DNA Polymerase (Applied 
Biosystems, Foster City, CA, USA), 10 mM Tris-HCl, 
50 mM KC1, 2mM MgCl 2 , 0.2 mM of each dNTP, 
0.1 |iM of each primer (12SV05F/R), and 5ug bovine 
serum albumin (BSA, Roche Diagnostic, Basel, 
Switzerland). The PCR mixture was denatured at 95°C 
for lOmin, followed by 45 cycles of 30 s at 95°C, and 
30 s at 60°C; as the target sequences are shorter than 
120 bp, the elongation step was removed to reduce the 
+A artifact (34,35) that might decrease the efficiency of 
the first step of the sequencing process (blunt-end 
ligation). The sequencing was carried out on an 
Illumina/Solexa Genome Analyzer IIx (Illumina Inc., 
San Diego, CA 92121, USA), using the Paired-End 
Cluster Generation Kit V4 and the Sequencing Kit V4 
(Illumina Inc., San Diego, CA 92121, USA), and follow- 
ing manufacturer's instructions. A total of 108 nucleotides 
were sequenced on each extremity of the DNA fragments. 

The sequence reads were analyzed using the OBITools 
software (http://www.prabi.grenoble.fr/trac/OBITools). 
First, the direct and reverse reads corresponding to a 
single molecule were aligned and merged using the 
solexaPairEnd program, taking into account data quality 
during the alignment and the consensus computation. 
Then, primers and DNA tag identifying samples were 
identified using the ngsfilter program. The amplified 
regions, excluding primers, were kept for further 
analysis. Strictly identical sequences were clustered 
together using the obiuniq program. Sequences shorter 
than 10 bp, or containing degenerated IUPAC nucleotide 
codes (other than A, C, G and T), or with occurrence less 
than or equal to 10 were excluded using the obigrep 
program. Taxon assignment was achieved using the 
ecoTag program (9). EcoTag relies on a dynamic 
programming global alignment algorithm (32) to find 
highly similar sequences in the reference database. This 
database was built by extracting the region between the 
two primers 12S-V5 of the mitochondrial 12S gene from 
EMBL nucleotide library using the output of the ecoPCR 
program, allowing a maximum of three mismatches 
between each primer and its target (10). 

All computations were done on a LINUX DELL server 
with 32 GB of RAM (Random Access Memory). 



RESULTS 

Empirical testing of ecoPrimers on a large data set 

The ability of ecoPrimers to analyze full genome data sets, 
allowing it to identify barcodes without a priori targeting 
of any potential locus, relies on its algorithm efficiency. 
Efforts have been made during algorithm conception both 
in terms of memory and time. We have empirically 
estimated the memory requirements of SPA and 
compared it with three algorithms KMR (23), Suffix 
trees (24) and Suffix arrays (25). Memory and time 
complexities were estimated using eubact as data set. 
Size of L P list and computation time was measured after 
each sequence insertion during SPA execution. 



SPA without data mining. The program was first run 
without data mining. Figure 2a displays the evolution of 
L P size. As expected, it increased during the insertion of 
the first 273 sequences. The limit value corresponds to 
I-EjI — 6s + 1- At this point, many words could not reach 
Q s and were discarded from L P . The maximum size of L P 
is about 7.8 GB for 3 Gb of sequences. This corresponds 
to a usage of about 3.6 bytes per nucleotide analyzed on 
both strands, including one byte to store the sequence 
itself. This is already better than the three standard algo- 
rithms, but this transient long list has a drastic impact on 
memory and speed performances. Time evolution during 
execution (Figure 2c) evolves in a quadratic way with the 
sequence count. Theoretically, in the worst case, the algo- 
rithm has a complexity of 0(N 2 ) during this phase, where 
N is count of processed sequences. Then time evolves 
linearly, as \L P \ becomes very small. With eubact data 
set, total time used for the strict primer algorithm is 
about 1 h and 40min. 

SPA with data mining. The experiment was repeated with 
data mining activated. This time the majority of hashed 
words were not included in the L w list because they 
occurred in less than Q s sequences of E s . The effect of 
this reduction of \L W \ is observable on Figure 2b. The 
memory size of L P is never over 2.5 KB (less than 210 
patterns). The global size used with data mining including 
H s , L P , L w and the sequence itself is about 1.1 bytes per 
nucleotide. The second effect of this drastic size reduction 
of L P and L w is the speed increase. With data mining the 
execution time of the strict primer detection is about 5 min 
(2 min for H m building and 3 min for strict primer detec- 
tion). Moreover empirical time complexity is now linear 
with the count of sequences (Figure 2d). 

Global execution. A full search for primers using data 
mining on the eubact data set is about 3 h 40 min. Main 
time is devoted to the agrep algorithm. Execution time of 
this part of our global algorithm is in 0((\E S + \C S \)\L P \). 
On this data set ecoPrimers never used more than 4 GB of 
memory. 

Designed primers. A Eubacteria training data set was used 
to demonstrate efficiency of the algorithm, so primers 
identified with this data set were not checked further. 
The program proposed almost 5521 primer pairs. Out of 
these 5521 primer pairs, we investigated the first few pairs 
and they seem to amplify part of functional RNA genes 
(rRNA 16S gene, rRNA 23S genes). The five pairs are 
presented in Table 1, they all correspond to parts of the 
16S gene. 

Validation of ecoPrimers on vascular plants 

As the majority of already published barcodes for plants 
correspond to regions of the chloroplast DNA (4,15,16), 
we ran ecoPrimers on the chloro data set. Three hundred 
and forty three primer pairs were selected out of 265 273 
primer pairs identified limiting the value of barcode speci- 
ficity to at least 50%. The specified parameters allow the 
selection of markers with properties similar to that of g/h 
primers (15). These primers have already been used for 
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Table 1. The five best primer pairs proposed by ecoPrimers to amplify potential barcode markers specific of eubacteria 



Sequences 




T 


Amplified E s 


B, 




Fra; 


mient 


size (bp) 


Region 


Direct 


Reverse 


PI 


P2 


Min 


Max 


Average 


CGACACGAGCTGACGACA 


CTACGGGAGGCAGCAGTG 


60.5 


60.8 


603 


1.00 


0.927 


668 


987 


699.07 


16S RNA 


CTACGGGAGGCAGCAGTG 


GGTATCTAATCCTGTTTG 


60.8 


47.5 


603 


1.00 


0.910 


392 


708 


417.52 


16S RNA 


CTACGGGAGGCAGCAGTG 


GCGGGCCCCCGTCAATTC 


60.8 


64.9 


603 


1.00 


0.907 


525 


844 


556.49 


16S RNA 


AGCAGCCGCGGTAATACG 


GCGGGCCCCCGTCAATTC 


61.1 


64.9 


603 


1.00 


0.842 


370 


666 


380.21 


16S RNA 


ACCGCGGCTGCTGGCACG 


CTACGGGAGGCAGCAGTG 


69.6 


60.8 


603 


1.00 


0.819 


128 


598 


152.66 


16S RNA 



Amplified E s column indicates electronically amplified species count belonging to the Eubacteria data set. 



Table 2. The five best primer pairs proposed by ecoPrimers to amplify potential barcode markers specific of vascular plants 



Primer name 


Sequences 


T, 


11 


Amplified E s 


B, 




Fra; 


*ment 


size (bp) 


Region 


Direct Reverse 


PI 


P2 


Min 


Max 


Average 


similar to g/h 


GGCAATCCTGAGCCAAAT TGAGTCTCTGCACCTATC 


56.1 


53.5 


1 14 


0.966 


0.711 


10 


90 


45.65 


(iviL-P6-loop 


similar to g/h 


ATTGAGTCTCTGCACCTA GGGCAATCCTGAGCCAAA 


52.7 


58.4 


1 14 


0.966 


0.658 


13 


93 


48.65 


(iviL-P6-loop 


similar to g/h 


AGCTTCCATTGAGTCTCT GGGCAATCCTGAGCCAAA 


53.0 


58.4 


111 


0.941 


0.649 


20 


100 


55.96 


(mL-P6-loop 




TGGTTATTTACTAAAATC TTTGGTTAAGATATGCCA 


41.9 


48.9 


1 16 


0.983 


0.647 


100 


103 


100.3 


psb CL 




GCAATCCTGAGCCAAATC GCTTCCATTGAGTCTCTG 


54.8 


53.4 


112 


0.949 


0.652 


17 


97 


52.73 


trriL 



g/h primers were proposed by Taberlet el ah (15) for vascular plant identification. Amplified E„ column indicates electronically amplified species 
count belonging to the vascular plant example set. 



Table 3. The five best primer pairs proposed by ecoPrimers to amplify potential barcode markers specific of vertebrates 



Primer Name Sequences T m Amplified B c B s Fragment size (bp) Region 





Direct 


Reverse 


PI 


P2 


E, 


c s 






Min 


Max 


Average 






ACTGGGATTAGATACCCC 


TAGAACAGGCTCCTCTAG 


52.6 


52.3 


1221 


31 


0.968 


0.858 


85 


117 


105.38 


16S RNA 


12S-K5 


TAGAACAGGCTCCTCTAG 


TTAGATACCCCACTATGC 


52.3 


50.7 


1236 


7 


0.980 


0.720 


73 


110 


98.32 


12S RNA 




AGGGATAACAGCGCAATC 


TCGTTGAACAAACGAACC 


55.6 


54.4 


1256 


18 


0.996 


0.459 


63 


84 


82.03 


12S RNA 


similar to 16Sr 


CTCCGGTCTGAACTCAGA 


GATGTTGGATCAGGACAT 


56.1 


52.1 


1253 


59 


0.994 


0.196 


53 


59 


58.22 


16S RNA 




ATGTTGGATCAGGACATC 


CTCCGGTCTGAACTCAGA 


52.1 


56.1 


1253 


35 


0.994 


0.195 


54 


60 


57.22 


16S RNA 



16Sr primers were proposed by Palumbi et al. (14) for mammal identification (37). Amplified E, and C s columns indicate electronically amplified 
species counts belonging respectively to the vertebrate example set and to the non-vertebrate counterexample set. 



several metabarcoding applications, such as diet analysis 
(9,36) or to reconstruct past arctic vegetation (6). Table 2 
presents the five primers pairs selected from five best 
regions identified by ecoPrimers. Not only did 
ecoPrimers identify primers similar to g/h as expected, 
amplifying the same trnh P6-loop, but it ranked them 
with the best mark. Most of the primer pairs amplify 
regions of functional RNA genes, or of introns. (34 
primers amplify regions of trriL, 41 primers amplify 
regions of trnW, 11 primers amplify regions of trnY and 
13 primer amplify regions of trnW. Finally 231 primer 
pairs amplify regions of protein coding genes including 
psaB, psaA, psbA, psbC and the intergenic region of 
psbL and psbF). 

Validation of ecoPrimers on vertebrates 

In a similar way as we did for vascular plants, we ran 
ecoPrimers on the mito data set, asking for primers amp- 
lifying only Vertebrata. 



Designed primers. Forty-two primer pairs were identified. 
As for previous tests, they were mainly located on 
non-protein coding sequences (30 in rRNA 16S gene, 12 
in rRNA 12S gene). The five best primer pairs are pre- 
sented in Table 3. The first of them, named 12S-V5, was 
more carefully checked using bioinformatics and experi- 
mental approaches (see below). The third and fourth cor- 
respond to variants of primers amplifying a region of the 
16S rRNA gene already proposed as barcode marker for 
mammals (14,37) 

Bioinformatics validation of the 12S-V5 primer pair. The 
12S-V5 primer pair amplifies a part of the 12S rRNA gene 
including its V5 variable region. The amplified region 
from the ecoPrimers results range from 73 bp to 110 bp. 
It is able to amplify 98% of the sequence training set 
(B c = 0.98) and unambiguously identifies 74% of those 
amplified species (B s = 0.74). Only 7 taxa of over 741 rep- 
resented in the counterexample set of sequences C s are 
recognized by this primer pair. Better estimation of the 
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quality of this barcode was achieved using ecoPCR against 
EMBL nucleotide database (10). We set ecoPCR param- 
eters to allow in silico PCR amplification ranging from a 
size between 50 bp to 250 bp with no more than 3 
mismatches per primer. It resulted in the potential ampli- 
fication of 17737 sequences of vertebrate (according to the 
EMBL annotation) and only 79 sequences belonging to 
other taxa. Of these non-vertebrate sequences, 66 of 
them belong to the Crustacea (NCBI Taxid: 6657), 5 
belong to Insecta (NCBI Taxid: 50557), 3 belong to 
Arthropoda (NCBI Taxid: 6656) and 1 sequence belongs 
to each of the following taxa: Gastropoda (NCBI Taxid: 
6448), Lineidae (NCBI Taxid: 6222), Loxosomatidae 
(NCBI Taxid: 231594). All these non-vertebrate taxa 
present two or three mismatches with both primers. The 
two last non-vertebrate sequences exhibit zero or one 
mismatch for both primers but they correspond to 
mis-assigned taxa. The first one embl:EU626452, 
annotated as an uncultured bacterium (NCBI Taxid: 
77133), is identical to a human sequence. The second 
one embl:AF257243, annotated as a nematode 
{Onchocerca volvulus NCBI Taxid: 6282), is similar to 
many bony fish (Actinopterygii NCBI Taxid: 7898) se- 
quences. The amplified vertebrate sequences correspond 
to 5926 species and 2732 genera. Among them 4537 
species (B s = 0.77) and 2430 genera (B s = 0.89) are unam- 
biguously identified. Among the 17737 sequences of ver- 
tebrate only 353 have two or three mismatches with the 
both primers. A total of 266 of them belong to reptiles 
(Sauropsida NCBI Taxid: 8457), 24 sequences belong to 
amphibians (Amphibia NCBI Taxid: 8292) and 3 



Table 4. Number of vertebrate species exhibiting from 0 to 3 
mismatches for forward and reverse 12S-V5 primers 



Number of mismatches 



Number of species 



Forward primer 



Reverse primer 



3272 
2031 
465 
158 



4592 
1021 
291 
20 



sequences belong to the Batrachoididae family (NCBI 
Taxid: 8065). The 60 remaining sequences belong to 
mammals (NCBI Taxid: 40674) but most of these 
sequences are annotated as a nuclear copy of this mito- 
chondrial locus. Table 4 resumes the distribution of 
mismatches of the two 12S-V5 primers among vertebrate 
species. 

Experimental validation of primer 12S-V5. The empirical 
testing of the 12S-V5 primer pair was carried on felid 
feces, to assess their diet. One, one and two feces were 
used for snow leopard (U. uncia), common leopard 
(P. pardus) and leopard cat (P. bengalensis), respectively. 
The results are summarized in Table 5. As expected, both 
felid (i.e. predator) and the prey sequences were obtained. 
The B s of the amplified sequences allowed us to unam- 
biguously distinguish the three predators, and to identify 
different prey, including three mammals, one bird and one 
amphibian. 



DISCUSSION 

In this article, we have clearly demonstrated the ability of 
the ecoPrimers software to fulfill all the requirements for 
designing new barcode regions suitable for metabarcoding 
studies. This software has the ability to scan large training 
databases (example and counterexample sets) so as to 
design highly conserved primers that have the potential 
to amplify a variable DNA region. The ranking of the 
primer pairs is based on the two previously proposed 
indices B c and B s (10) that evaluate the taxonomic range 
potentially amplified by a primer pair, and the discrimin- 
ation capacity of the amplified region, respectively. A 
large set of parameters can be specified for tuning the al- 
gorithm, including (i) the maximum number of errors 
allowed between each primer and the target sequence, 
(ii) the possibility to restrict the search to a given taxo- 
nomic level (example set), (hi) the possibility to define a set 
of counterexample taxa that the primers should not 
amplify (within or outside of the clade used for the 
search), (iv) the minimum and maximum length of 
the amplified region, (v) the possibility to consider that 
the database sequences are circular, (vi) the possibility to 



Table 5. Count of sequences observed per sample after Solexa sequencing of 4 PCR amplicons 



Feces 







Common leopard 


Snow leopard 


1 


Leopard cat 

2 


Predator 


Common leopard {P. pardus) 


2460 










Snow leopard (U. uncia) 




10807 








Leopard cat (P. bengalensis) 






1982 


9765 


Prey 


Domestic goat (Capra hircus) 


2969 










Siberian ibex (Capra sibirica) 




1256 








Shrew (Crocidura pullata) 








964 




Chukar partridge (Alectoris clntkar) 






1711 






Muree hill frog (Paa vicina) 








982 



Each of them corresponds to one predator feces. 
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require a strict match on a specified number of nucleotides 
on 3'-end of the primers, (vii) the proportion of strict 
matching primers on the example set, (viii) the proportion 
of primers matching with specified number of errors on 
the example set, (ix) the proportion of primers matching 
the counterexample dataset, and finally (x) the possibility 
of avoiding primers matching more than once in one 
sequence of the example set. The efficiency of ecoPrimers 
has been successfully validated, both via bioinformatics 
analyses and via empirical experiments. 

The main advantage and the originality of ecoPrimers 
is its full integration of the taxonomy. This characteris- 
tic has been implemented in a way that allows the 
design of new barcodes specific to any taxonomic 
group, as well as the optional exclusion of any other 
clades. For example, if analyzing the fish diet of an 
otter (genus Lutra) using their feces, it is possible with 
ecoPrimers to design a short barcode that includes all 
teleost fish (Teleostei) and excludes the genus Lutra; 
such a strategy will not only promote prey DNA amp- 
lification, but also prevent otter DNA amplification. 
Another key advantage is the speed efficiency of the 
ecoPrimers algorithm when it is used on whole mito- 
chondrial or chloroplast genomes as example sets, and 
its ability to run on other huge data sets like whole 
eubacteria genomes. 

ecoPrimers is particularly useful for setting up the 
analysis of environmental samples using a metabarcoding 
approach. In such a situation, to avoid amplification bias 
among the different taxonomic groups, it is extremely 
important to work with highly conserved primers. 
Unfortunately, for higher taxonomic group (e.g. verte- 
brate, angiosperms) it is impossible to find primer 
pairs amplifying all species without mismatch (B c ) and 
with a good specificity (B s ). So we cannot exclude that 
some species could be missed by a primer pair. To 
limit potential problems related to relatively low 
coverage of a primer pair, it could be useful to analyze 
the same sample with several markers targeting the same 
taxonomic group. 

The possibility to choose the length of the barcode is 
crucial when working with degraded DNA: in such a 
context only fragments shorter than 100 bp can be 
reliably amplified. According to our experience, in some 
taxonomic groups, it is even possible to design extremely 
short barcodes that nevertheless have a very high coverage 
and specificity. This is the case for earthworms 
(Lumbricina) where a 30 bp barcode located on the mito- 
chondrial 16S gene allows the identification of all species 
from the French Alps analyzed up to now (Bienert et ah, 
submitted for publication). Even when using good quality 
DNA, the length of the sequence reads obtained from the 
DNA sequencer might impose a maximum length when 
designing new barcodes. The current standardized 
barcodes for animals (38) and plants (4) were designed 
according to the technological characteristics of the 
Sanger sequencing using capillary electrophoresis 
(sequence reads shorter than 1 kb). In the near future, if 
the read length of next generation DNA sequencers in- 
creases to several kilobases, it might be worthwhile to 
redesign much longer barcodes to significantly increase 



the taxonomic resolution. As more and more whole mito- 
chondrial and chloroplast genomes become available, 
ecoPrimers has the potential to provide new optimal 
barcodes. 

The majority of barcodes proposed by ecoPrimers 
for Eubacteria, vascular plants and vertebrates are 
located on ribosomal DNA. The only exception was on 
chloroplast DNA, with a few primers located either on 
transfer RNA or on protein genes. As a consequence, 
the example set of sequences can be taxonomically 
enlarged by only taking into account the ribosomal 
genes, and not the whole mitochondrial or chloroplast 
genomes. In the same way, if the goal is to design a 
nuclear barcode, the nuclear ribosomal genes can be effi- 
ciently used as the example set. 

According to our experience, it is sometimes difficult to 
find suitable short barcodes for some taxonomic groups, 
particularly if they diverged a very long time ago. Usually, 
the higher the taxonomic level considered, the greater the 
difficulty to find universal barcodes. If such a problem 
occurs, we advise first modifying the parameters by 
relaxing as much as possible the different constraints, 
and then trying to design several barcodes, one for each 
of the taxonomic groups at a lower level. The other option 
is to degenerate the proposed primers to enlarge their 
taxonomic coverage. Combined use of ecoPrimers and 
ecoPCR (10) is convenient for this purpose. 

As more and more sequences become available in public 
databases, by using larger example sets, ecoPrimers will be 
more and more efficient for designing new barcodes that 
can be precisely optimized according to the biological 
question and to the experimental constraints. The bio- 
logical question might impose a particular level of speci- 
ficity (e.g. species level), or conversely a broad taxonomic 
range, but with a resolution at the family level. The ex- 
perimental constraints might concern the length of the 
barcode, or the avoidance of amplifying another 
non-target taxonomic group. The analysis of environmen- 
tal samples using next generation sequencers is already 
frequently used for estimating the diversity of bacteria, 
e.g. (33), fungi, e.g. (39), and more recently of nematodes, 
e.g. (40). There are more and more research projects ex- 
tending the approach to other taxonomic groups. In such 
a context, the availability of a program allowing the 
design of the most suitable barcode will probably 
enhance studies analyzing the biodiversity of environmen- 
tal samples. ecoPrimers is available as an open source 
software at: http://www.grenoble.prabi.fr/trac/ 
ecoPrimers. 
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